library(sf)
library(dplyr)
library(leaflet)
Working with Geo database
What is in a .gdb?
Geodatabase (.gdb) is a modern spatial storage format developed by ESRI. It is a folder that contains many files and subfolders. A geodatabase (.gdb) is not just a file, but a collection of various types of data like feature classes (points, lines, polygons), raster data, tables, and more. It’s a robust and scalable way to store spatial data, offering advantages like spatial indexing, topologies, and network datasets for complex spatial analysis.
As a an example lets use the US Census Beureau’s 2022 Cartographic Boundary Files. We’ve downloaded the geodatabse file and now lets do a quick EDA!
Working with Geodatabase (.gdb) in R
Before we start working we’ll load our libraries: We’ll use just three packages sf
for spatial data, dplyr
for data manipulation and leaflet
for visualization.
Step 1: See what layers are in the .gdb
A geodatabase (.gdb) is not just a file, but a collection of various types of data like feature classes (points, lines, polygons), raster data, tables, and more. It’s a robust and scalable way to store spatial data, offering advantages like spatial indexing, topologies, and network datasets for complex spatial analysis.
The first step is to see what ‘layers’ are available in the .gdb.
## Declare path to .gdb
= 'cb_2022_us_all_20m.gdb'
path_gdb
## Examine layers
::st_layers(path_gdb) sf
Driver: OpenFileGDB
Available layers:
layer_name geometry_type features fields crs_name
1 cb_2022_us_cd118_20m Multi Polygon 437 9 NAD83
2 cb_2022_us_county_20m Multi Polygon 3222 12 NAD83
3 cb_2022_us_division_20m Multi Polygon 9 8 NAD83
4 cb_2022_us_nation_20m Multi Polygon 1 3 NAD83
5 cb_2022_us_region_20m Multi Polygon 4 8 NAD83
6 cb_2022_us_state_20m Multi Polygon 52 9 NAD83
We can see that this geo-database contains six layers; let just work with two of then county and state focusing on Pennsylvania.
Step 2: Import State and County boundaries (PA)
The sf package in R is a versatile tool for anyone working with spatial data. It simplifies the management and analysis of geospatial information, supporting common formats like shapefiles and geodatabases. With sf, users can seamlessly integrate spatial data with R’s powerful data analysis capabilities, making complex spatial tasks both accessible and efficient.
Lets import the boundaries as sf
objects.
## Import state
= sf::st_read(path_gdb,"cb_2022_us_state_20m") %>%
sf_state ::st_transform("+proj=longlat +datum=NAD83") sf
Reading layer `cb_2022_us_state_20m' from data source
`D:\git\analytics-corner\pages\blog\issues\999 - working with gdb\cb_2022_us_all_20m.gdb'
using driver `OpenFileGDB'
Simple feature collection with 52 features and 9 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -179.1743 ymin: 17.91377 xmax: 179.7739 ymax: 71.35256
Geodetic CRS: NAD83
## Check data structure
glimpse(sf_state)
Rows: 52
Columns: 10
$ STATEFP <chr> "48", "06", "21", "13", "55", "41", "29", "51", "47", "22", "~
$ STATENS <chr> "01779801", "01779778", "01779786", "01705317", "01779806", "~
$ AFFGEOID <chr> "0400000US48", "0400000US06", "0400000US21", "0400000US13", "~
$ GEOID <chr> "48", "06", "21", "13", "55", "41", "29", "51", "47", "22", "~
$ STUSPS <chr> "TX", "CA", "KY", "GA", "WI", "OR", "MO", "VA", "TN", "LA", "~
$ NAME <chr> "Texas", "California", "Kentucky", "Georgia", "Wisconsin", "O~
$ LSAD <chr> "00", "00", "00", "00", "00", "00", "00", "00", "00", "00", "~
$ ALAND <dbl> 676685555821, 403673617862, 102266581101, 149486268417, 14029~
$ AWATER <dbl> 18974391187, 20291712025, 2384240769, 4418716153, 29343193162~
$ SHAPE <MULTIPOLYGON [°]> MULTIPOLYGON (((-106.6234 3..., MULTIPOLYGON (((~
## Map to check
%>%
sf_state leaflet() %>%
addTiles() %>%
addPolygons(
label = ~NAME, # Custom label with multiple details
labelOptions = labelOptions(
html = T,
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "13px",
direction = "auto"
) )
## Import county
= sf::st_read(path_gdb,"cb_2022_us_county_20m") %>%
sf_county ::st_transform("+proj=longlat +datum=NAD83") sf
Reading layer `cb_2022_us_county_20m' from data source
`D:\git\analytics-corner\pages\blog\issues\999 - working with gdb\cb_2022_us_all_20m.gdb'
using driver `OpenFileGDB'
Simple feature collection with 3222 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -179.1743 ymin: 17.91377 xmax: 179.7739 ymax: 71.35256
Geodetic CRS: NAD83
## Check data structure
glimpse(sf_county)
Rows: 3,222
Columns: 13
$ STATEFP <chr> "17", "27", "37", "47", "06", "17", "19", "22", "18", "33",~
$ COUNTYFP <chr> "127", "017", "181", "079", "021", "093", "095", "003", "05~
$ COUNTYNS <chr> "01784730", "00659454", "01008591", "01639755", "00277275",~
$ AFFGEOID <chr> "0500000US17127", "0500000US27017", "0500000US37181", "0500~
$ GEOID <chr> "17127", "27017", "37181", "47079", "06021", "17093", "1909~
$ NAME <chr> "Massac", "Carlton", "Vance", "Henry", "Glenn", "Kendall", ~
$ NAMELSAD <chr> "Massac County", "Carlton County", "Vance County", "Henry C~
$ STUSPS <chr> "IL", "MN", "NC", "TN", "CA", "IL", "IA", "LA", "IN", "NH",~
$ STATE_NAME <chr> "Illinois", "Minnesota", "North Carolina", "Tennessee", "Ca~
$ LSAD <chr> "06", "06", "06", "06", "06", "06", "06", "15", "06", "06",~
$ ALAND <dbl> 614218330, 2230473967, 653701481, 1455320362, 3403160299, 8~
$ AWATER <dbl> 12784614, 36173451, 42190675, 81582236, 33693344, 5136525, ~
$ SHAPE <MULTIPOLYGON [°]> MULTIPOLYGON (((-88.92876 3..., MULTIPOLYGON (~
## Map to check
%>%
sf_county leaflet() %>%
addTiles() %>%
addPolygons(
label = ~NAME, # Custom label with multiple details
labelOptions = labelOptions(
html = T,
style = list("font-weight" = "normal", padding = "3px 8px"),
textsize = "13px",
direction = "auto"
) )
Step 3: Wrangling to focus on PA
Once we have imported the boundaries as sf
objects we can start to leverage tidyverse tools to do wrangling or analysis. As a simple walkthrough lets do the following:
- filter boundaries for PA
- operationalize some nicer labels for counties
- append some data for mapping
## Filter for PA and op. labels
= sf_county %>%
sf_county_pa filter(STUSPS == 'PA')
## Add some data
library(arrow)
= arrow::read_parquet("df_demographics.parquet") %>%
df_demographics filter(geo == 'county',
== 2020) %>%
year select(GEOID = geoid, median_age)
= sf_county_pa %>%
sf_county_pa left_join(df_demographics, by = 'GEOID')
## Op. labels
= sf_county_pa %>%
sf_county_pa rowwise() %>%
mutate(
county_age_label = paste0(NAME, '<br>',
'Median Age: ', median_age) %>%
::HTML()
htmltools%>%
) ungroup()
Summary
In this post we learned how to work with geodatabases in R. We used the sf
package to import the boundaries and then used leaflet
to visualize the data.